# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')

Building a Named Entity Recognition Model¶

This notebook relates to the Groningen Meaning Bank - Modified dataset. The dataset contains tags for parts of speech and named entities in a set of sentences predominantly from news articles and other factual documents. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.

In this notebook, we use the cleaned data file gmb_subset_full_cleaned.csv to generate a collection of new text features and train a simple model to perform named entity recognition (NER). NER, a subtask of information extraction, aims to identify and classify named entities in unstructured text. Entities can be classified into categories such as people, locations, organizations, etc. Each token in the gmb_subset_full_cleaned.csv dataset includes an entity label under the entitytags category. Our goal will be to predict this label by building a simple NER model.

0. Prerequisites¶

Before you run this notebook complete the following steps:

Insert a project token
Install and import required packages

Insert a project token¶

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

Click on More -> Insert project token in the top-right menu section

ws-project.mov

This should insert a cell at the top of this notebook similar to the example given above.

If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below

Import required packages¶

Install and import the required packages:

io
pandas
eli5
sklearn
sklearn_crfsuite

# Installing packages needed for data processing, visualization, and modeling
!pip install numpy pandas sklearn sklearn_crfsuite eli5

# Clear output of messy cells
from IPython.display import clear_output
clear_output()

# Define required imports
import io
import pandas as pd
import eli5
from sklearn.model_selection import train_test_split
import sklearn_crfsuite
from sklearn_crfsuite import scorers, metrics

1. Read in the Prepared Data¶

We start by reading in the gmb_subset_full_cleaned.csv dataset that was created in the project notebook Part 1 - Data Cleaning.

Note if you haven't yet run that notebook, do that first otherwise the cells below will not work.

# Function to load data asset into notebook
def load_data_asset(data_asset_name):
    """
    Loads a data asset 

    :param data_asset_name: filename of desired text data asset
    :returns: data asset as TextIOWrapper object
    """
    
    r = project.get_file(data_asset_name)
    if isinstance(r, list):
        bio = [ handle['file_content'] for handle in r if handle['data_file_title'] ==  data_asset_name][0]
        bio.seek(0)
        return io.TextIOWrapper(bio, encoding='utf-8')
    else:
        r.seek(0)
        return io.TextIOWrapper(r, encoding='utf-8')

# Read in gmb_subset_full_cleaned.txt file
tf = load_data_asset('gmb_subset_full_cleaned.csv')
df = pd.read_csv(tf, usecols=['term','postags','entitytags','sentence_id'])

Peak at the newly created dataframe.

df.head()

2. Perform Feature Engineering¶

In this section, we generate new features to be later used during modeling. As you can see from above, every token in our dataset is thus far accompanied by its sentence id, its part-of-speech tag, and its entity label. In section 3, we will build a named entity recognition (NER) model which will attempt to predict the entity tag column. To improve the results of our NER model, we will add a set of new features to our dataset so that our model has more information to train on.

2.1 Arrange data structure

In the modeling section, we will train a conditional random fields (CRF) model using the sklearn-crfsuite wrapper. The .fit(X,y) method of our CRF model object takes in:

X data as a list of lists of dicts. X will be a list, where each sublist represents a sentence/document, and each sentence is in turn composed of words and their features represented by dictionary objects.
y data as a list of lists of strings. y will be a list where each sublist represents a sentence/document composed of strings representing the entity labels of the corresponding word dicts in X.

To create these data structures, we will first group our pandas dataframe by sentence id and convert each grouped sentence into a list of lists. We will then generate our word dicts during the feature creation step.

# Group words by sentence, and for each grouping/sentence, create a list where each sublist represents a word with its tags
sentences = []
for _, group in df.groupby('sentence_id'):
    sentences.append(group[['term', 'postags', 'entitytags']].values.tolist())

Peak at the array structure of one sentence.

sentences[0]

2.2 Create features

We will now create several functions that manipulate our sentences object, which is a list of lists of lists, into the X and y objects required by the fit(X,y) method. We define a function called word2features which will generate additional features on each of our tokens. Because we would like our model to see the adjacent words of each of our tokens as it trains on our data, we structure word2features so that it takes in a full sentence (represented as a list of lists) and the word to generate features on (represented as an index of the supplied sentence). This makes it easier to reference adjacent words as we generate features on them by simply incrementing or decrementing the supplied index. You can see below that the features we are creating are all dependent on either the word itself (e.g. is it uppercase?, is it a digit?, is it alphanumeric?) or the word's part-of-speech tag.

We also create the sentence2features and sentence2labels functions which assist in converting our data at the sentence level. sentence2features generates the feature dicts for each word in a sentence by calling word2features and sentence2labels generates a list of entity labels for each sentence.

def word2features(sentence, i):
    """
    Generates a feature dictionary for a word using the word's position in a sentence
    Function produced from code from sklearncrfsuite tutorial: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

    :param sentence: a sentence stored as a list of lists
    :param i: the index of the word in sentence to generate features for
    :returns: feature dictionary for the supplied word
    """
    
    word = sentence[i][0]
    postag = sentence[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'word.isalnum()': word.isalnum(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    
    # If i not the first word in the sentence
    if i > 0 and len(sentence) > 1:  
        word1 = sentence[i-1][0]
        postag1 = sentence[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['__START1__'] = True

    # If i not the last word in the sentence
    if i < len(sentence)-1:
        word1 = sentence[i+1][0]
        postag1 = sentence[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['__END1__'] = True

    return features

def sentence2features(sentence):
    return [word2features(sentence, i) for i in range(len(sentence))]

def sentence2labels(sentence):
    return [label for token, postag, label in sentence]

Using the functions defined above, we can now create our desired X and y data objects simply by looping through each sentence list in our sentences object. This step may take a few minutes to run because it requires us to loop through all 57,317 sentences.

# Generate features and labels data 
X = [sentence2features(s) for s in sentences]
y = [sentence2labels(s) for s in sentences]

2.3 Split dataset

Now that we have properly formatted X and y objects, we can use sklearn's train_test_split() function to split our data, 80-20 into train and test sets. We set a random state so that our train-test split is replicable every time we rerun this notebook since by default train_test_split() shuffles our data.

# Split data into train, test, and labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)

Peak at a feature dictionary for one word.

X_train[0][2]

3. Create CRF Model¶

In this section, we train a conditional random fields (CRF) model to make entity tag predictions on our test data. The conditional random fields model is a type of discriminative machine learning classifier best used for predicting sequences. It is often used in natural language processing to tag or identify certain words in a sentence. For a mathematical background behind CRF's check out this Medium blog post.

We will be using the CRF model to perform named entity recognition (NER). Our trained CRF model will be able to predict a word's entity tag in our test set. Since our entity tags use the inside-outside-beginning (IOB) tagging format, you can see why a model that specializes in predicting sequences is a good choice for our task. When predicting for example the entity tags of the words San Francisco, where San should be tagged as B-GEO and Francisco as I-GEO, a model that has learned the entity transition B-GEO to I-GEO is a common occurrence would be likelier to tag these words accurately.

As already mentioned above, we will be using the sklearn-crfsuite wrapper to train our model. This wrapper allows us to use a fast implementation of the CRF algorithm while still being able to interface with the model using sklearn's model selection utilities.

3.1 Train the model

To train an instance of the sklearn_crfsuite.CRF() class we initialize a model with a few different parameters. We specify that the model use gradient descent using the L-BFGS method. We also set values for c1 and c2 which control L1 and L2 elastic net regularization respectively. We set max possible iterations for the data to be 100 and we set all_possible_transitions to True so that the model generates all possible entity transition features (including negative weights for transitions that might not occur in our training data). These hyperparameter values were chosen as recommended defaults for this model, but they may be further tuned to improve the model's performance.

# Create a CRF model

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
    verbose=True
)

We can now fit our model to our training data. This step will take a few minutes to fully run but you can follow the cell's printout to monitor the training progress since we set the model to train in verbose mode. A message will print indicating when the model is trained. In Watson Studio's free tier Python environment, this step may take up to ~15 minutes, but if you choose to run this notebook with a better hardware configuration, the training time of your model will be greatly reduced. You can check out all available Watson Studio notebook runtime environments here.

%%time

# Fit the CRF model
print('='*30, 'Begin training model...', '='*30)
crf.fit(X_train, y_train)
print('='*30, 'Training has finished...', '='*30)

3.2 Inspect the model weights & accuracy

Now that we have a trained instance of a CRF model, we can take a closer look at the feature weights it learned as well as calculate how well it can predict on our training data.

We first use the eli5 package to visualize the CRF model's weights. Below you can see the .show_weights() method generates two charts, one for each type of feature set the CRF model learns. The first chart represents learned transition features. The y-axis represents the from entity and the x-axis the to entity. You can see from the color highlighting which transition pairs have stronger positive or negative weights, or in other words are more or less-likely to occur according to our model. Many of these learned weights seem logical; I entities for instance are likely to follow B entities of the same category (e.g. B-ORG -> I-ORG), and are almost as likely to also follow I entities of the same category (e.g. I-ORG -> I-ORG). B entities are unlikely to follow B entities of the same category (e.g. B-ORG -> B-ORG). Impossible transitions are also learned based on the very strong negative weights for I entities following O (other) entities.

The second chart below represents the model's state features. For each entity label, we see the top 30 weights corresponding to the features that had the strongest predictive power for that label.

eli5.show_weights(crf, top=30)

We can zoom in closer to the most interesting labels by customizing a few of the parameters in our eli5.show_weights() function call. Notice now the off-diagonal feature weights in our transition features chart. It seems like manmade artifacts are likely to follow a geography or an organization, and times are likely to follow a geography. In our state features chart, notice how a word ending in the letters day is a strong predictor that the word is a time entity or that mr. or vice are likely to predict a person. Also notice how the the next word in the sequence being regional predicts the current word is a geography.

# Zoom in on just the strongest weights based on observations from above
eli5.show_weights(crf, top=10, targets=['B-ART', 'B-GEO', 'I-GEO', 'B-TIM', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'])

We now use our trained CRF model to predict on our test set and calculate a few common classification metrics for how well it does to classify the entities in this set. We care about precision when there is a higher cost associated with making false positive predictions, and about recall when there is a higher cost associated with making false negative predictions. The f1-score is a metric that balances precision and recall when both false negatives and false positives are equally undesirable. We will use the f1-score to compare this model's performance with the model that you will build in the next notebook.

# Use trained CRF model to predict on test data and calculate metrics for entity labels
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred))

Based on the f1-scores for individual entity labels, we can see that our model performs best on O (other) entities, likely because we have the most training data for that category, but also does reasonably well for many of the other entity labels. The model does not predict the B-ART, I-ART, or I-EVE categories very accurately. This may be due to these categories not being represented as heavily as the others are in our training data, as we saw in the previous notebook.

Authors¶

This notebook was created by the Center for Open-Source Data & AI Technologies.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up